Analysis of Data Quality
We found no significant issues in data quality that would inhibit analysis, with the possible exception of geo-location pertaining to certain types of crimes.
The dataset has 24 variables and approximately 5.6 million rows, each listing a single complaint or event. The Report Date (the case reporting time) ranges from 2006-01-01 to 2016-12-31. In this section, we will investigate the patterns of missing data as well as potential errors in the data in order to determine how well we can trust the data we are using.
Missing variables Pattern using Visna
A summary graphic of missing patterns is shown below
visna(crime_df,sort="b")

From the above Visna Plot, our observations about the missing patterns are as below :
- Five of the 24 variables have no data quality concerns:
- Complaint Number
- Report Date
- Offense Classification Code
- Level of Offense
- Jurisdiction responsible
- AtptCptdStatus is an indicator of whether crime attempted or completed. There are only 7 missing cases; 5,483,869 coded as completed, and 96,159 cases indicated as attempted.
- ParkName is recorded if the event occurred in a park. Most of the cases doesn’t have this variable simply because most crimes did not occur in parks. We cannot estimate the percent of real missing park data.
Missing Start-Date of Crime Reported
#get the reporting dates of cases with starting dates missing and histogram over DateReport
crime_df%>%select(DateStart,DateReport)%>%filter(is.na(DateStart))%>%select(DateReport)->tmp1
ggplot(tmp1,aes(DateReport))+ geom_histogram(bins=120)->p1
#compare pattern of crime Level between all cases vs. cases with missing DateStart
crime_df%>%select(DateStart,Level)%>%filter(is.na(DateStart))%>%select(Level)%>%group_by(Level)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp3; nr1=nrow(tmp3)
tmp3%>%mutate(type=replicate(nr1,"DateStart_NA"))->tmp3
crime_df%>%select(Level)%>%group_by(Level)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp5; nr2=nrow(tmp5)
tmp5%>%mutate(type=replicate(nr2,"All"))->tmp5
rbind(tmp3,tmp5)->tmp3tmp5
tmp3tmp5%>%ggplot(aes(Level,RelFreq))+geom_bar(stat="identity")+scale_x_discrete(label=abbreviate)+
coord_flip()+xlab("Level of Offense")+ylab("Relative Frequency")+facet_wrap(~type)->p2
#compare pattern of ClassCode between all cases vs. cases with missing DateStart
crime_df%>%select(DateStart,ClassCode)%>%filter(is.na(DateStart))%>%select(ClassCode)%>%mutate(ClassCode=as.factor(ClassCode))%>%group_by(ClassCode)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp2; nrr1=nrow(tmp2)
tmp2%>%mutate(type=replicate(nrr1,"DateStart_NA"))->tmp2
crime_df%>%select(ClassCode)%>%mutate(ClassCode=as.factor(ClassCode))%>%group_by(ClassCode)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp4; nrr2=nrow(tmp4)
tmp4%>%mutate(type=replicate(nrr2,"All"))->tmp4
rbind(tmp2,tmp4)->tmp2tmp4
tmp2tmp4%>%ggplot(aes(ClassCode,RelFreq),na.rm=FALSE)+geom_bar(stat="identity")+theme(text = element_text(size=5))+
coord_flip()+xlab("Offense Classification Code")+ylab("Relative Frequency")+facet_wrap(~type)->p3
grid.arrange(p1,p2,p3,nrow=1)

There are total of 655 complaints missing the crime Start Date. The missing data is distributed the same as non-missing data, so we are not concerned about this random-appearing missing data.
When looking at the Date of the Report against the volume of missing Start Dates, we notice that the are far more missing Start Dates in the earlier time periods with the volume of missing leveling out after 2010. The frequency of missing Start Date by Level of Crime looks the same as the non-missing Start Date data. The frequency distribution of the Offense Code shares the same pattern as the non-missing Start Date data.
Errors in Start-Date of Crime Reported
crime_df%>%select(DateStart,DateEnd,DateReport)->df_3DT
df_3DT%>%filter(DateStart<=as.Date("1900-01-01"))->df_3DT_Year1900
df_3DT%>%filter(DateStart>=as.Date("1900-01-01") & DateStart<=as.Date("1920-01-01"))->df_3DT_Year1900to1920
df_3DT%>%filter(DateStart>=as.Date("1920-01-01") & DateStart<=as.Date("1960-01-01"))->df_3DT_Year1920to1960
df_3DT%>%filter(DateStart>=as.Date("1960-01-01") & DateStart<=as.Date("1980-01-01"))->df_3DT_Year1960to1980
#association between report date and complaint date indicating possible typo in recording the data
splom(df_3DT_Year1900,varname.cex = .5,axis.text.cex = 0.5,cex=.5,xlab=NULL)->pl1
splom(df_3DT_Year1900to1920,varname.cex = .5,axis.text.cex = .5,cex=.5,xlab=NULL)->pl2
splom(df_3DT_Year1920to1960,varname.cex = .5,axis.text.cex = .5,cex=.5,xlab=NULL)->pl3
splom(df_3DT_Year1960to1980,varname.cex = .5,axis.text.cex = .5,cex=.5,xlab=NULL)->pl4
grid.arrange(pl1,pl2,pl3,pl4,nrow=2)

- There seems to be errors in DateStart. Some cases are shown with a year of 1015 (needless to say, a time frame not covered by this dataset). By comparing those Start Dates to the Dates of the Report, The two dates usually have very close month/date. The DateEnd variable also suggests the actual year to be 2015, and hence, a typo.
- The scatterplot of DateStart vs DateReport did show some strict linear correlation for many cases during some periods.
- As shown in the figure, the amount of such cases is very small. In our main analysis, we will focus on cases with DateStart after Jan. 1, 2000 up until Dec. 31, 2016 in total over 5.57M. Cases with DateStart earlier than Jan. 1, 2000 are totaled 1549 which will be ignored.
Missing Premises Description (PremDesc)
#get the reporting dates of cases with PremDesc missing and histogram over DateReport
#crime_df%>%filter(is.na(PremDesc))%>%dplyr::summarise(count=n()) #35198 missing cases
crime_df%>%select(PremDesc,DateReport)%>%filter(is.na(PremDesc))%>%select(DateReport)->tmp1
ggplot(tmp1,aes(DateReport))+ geom_histogram(bins=120)->p1
#compare pattern of crime Level between all cases vs. cases with missing PremDesc
crime_df%>%select(PremDesc,Level)%>%filter(is.na(PremDesc))%>%select(Level)%>%group_by(Level)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp3; nr1=nrow(tmp3)
tmp3%>%mutate(type=replicate(nr1,"PremDesc_NA"))->tmp3
crime_df%>%select(Level)%>%group_by(Level)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp5; nr2=nrow(tmp5)
tmp5%>%mutate(type=replicate(nr2,"All"))->tmp5
rbind(tmp3,tmp5)->tmp3tmp5
tmp3tmp5%>%ggplot(aes(Level,RelFreq))+geom_bar(stat="identity")+scale_x_discrete(label=abbreviate)+
coord_flip()+xlab("Level of Offense")+ylab("Relative Frequency")+facet_wrap(~type)->p2
#compare pattern of ClassCode between all cases vs. cases with missing PremDesc
crime_df%>%select(PremDesc,ClassCode)%>%filter(is.na(PremDesc))%>%select(ClassCode)%>%mutate(ClassCode=as.factor(ClassCode))%>%group_by(ClassCode)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp2; nrr1=nrow(tmp2)
tmp2%>%mutate(type=replicate(nrr1,"PremDesc_NA"))->tmp2
crime_df%>%select(ClassCode)%>%mutate(ClassCode=as.factor(ClassCode))%>%group_by(ClassCode)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))->tmp4; nrr2=nrow(tmp4)
tmp4%>%mutate(type=replicate(nrr2,"All"))->tmp4
rbind(tmp2,tmp4)->tmp2tmp4
tmp2tmp4%>%ggplot(aes(ClassCode,RelFreq),na.rm=FALSE)+geom_bar(stat="identity")+theme(text = element_text(size=10))+
coord_flip()+xlab("Offense Classification Code")+ylab("Relative Frequency")+facet_wrap(~type)->p3
grid.arrange(p1,p2,p3,nrow=1)

We do not rely on the description of premises much in our analyses, but readers should note that approximately 0.6% of the crimes do not have a description of the premises, with a disproportionately high number of those cases pertaining to murder/manslaughter.
- There are 35,198 cases missing the description of the premises.
- Comparison of the pattern of ClassCode between all cases vs. cases with missing PremDesc indicates that the category with ClassCode=101, which represents felony crime of MURDER & NON-NEGL. MANSLAUGHTER, have much higher frequency in the missing data.
- In 2006, there are many more cases with missing PremDesc than other years. Year 2006 had more cases overall compared to other years.
Mismatch between Precinct (Pct) and Borough (Boro)
crime_df %>% select(Pct,Boro)%>%group_by(Pct,Boro)%>%drop_na()%>%dplyr::summarise(count=n())->tmp1
uniPct<-unique(tmp1$Pct)
for (i in 1:length(uniPct)) {
if(nrow(tmp1%>%filter(Pct==(uniPct)[i])) > 1){
print(tmp1%>%filter(Pct==(uniPct)[i]))
}
}
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 6 BRONX 1
## 2 6 MANHATTAN 59559
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 7 BROOKLYN 1
## 2 7 MANHATTAN 45259
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 9 BROOKLYN 1
## 2 9 MANHATTAN 67822
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 13 BROOKLYN 1
## 2 13 MANHATTAN 81145
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 14 BROOKLYN 1
## 2 14 MANHATTAN 129697
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 23 BRONX 3
## 2 23 MANHATTAN 73154
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 25 BRONX 1
## 2 25 MANHATTAN 74073
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 26 BROOKLYN 1
## 2 26 MANHATTAN 37213
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 71 BRONX 1
## 2 71 BROOKLYN 78909
## # A tibble: 3 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 104 BROOKLYN 1
## 2 104 MANHATTAN 1
## 3 104 QUEENS 81151
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 106 BROOKLYN 1
## 2 106 QUEENS 67367
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 114 BRONX 2
## 2 114 QUEENS 100798
## # A tibble: 2 x 3
## # Groups: Pct [1]
## Pct Boro count
## <int> <fct> <int>
## 1 121 BROOKLYN 1
## 2 121 STATEN ISLAND 23804
There are a very small number of cases where Precinct or Borough are missing (0.007% and 0.008% of all cases, respectively). There is an even smaller number of cases where the Precinct shows an incorrect Borough.
- There are total of 77 distinct police precincts in NYC. A list above shows cases where a Borough is listed for a Precinct that is inconsistent with the vast majority of cases.
- There are about 16 cases where the Precinct shown is not consistent with the Borough in this manner. For those Precincts with such inconsistencies, the number of cases with a inconsistent Borough name range from 1 to 3 cases. This problem can be fixed by correcting the Borough names of those rare cases using a map of Pct vs Boro established from dataset.
- 390 cases have a missing Pct. 463 cases have a missing Boro.
** This section below needs review, did not find in word, I see that we are using this to correct and merge, so should we move to appropriate section **
#match_pct_boro can be used to fix the problem of double/triple borough names of specific Pct
crime_df %>% select(Pct,Boro)%>%group_by(Pct,Boro)%>%drop_na()%>%dplyr::summarise(count=n())%>%
group_by(Pct)%>%dplyr::summarise(Boro=Boro[count==max(count)])->match_pct_boro
#It takes time to merge onto crime_df to have Pct and Boro have 1-to-1 correspondense (so I commented out).
#Another way is to perform merge after you have manipulted your data into a small array, i.e. only merge when needed; so you can
#replace "crime_df%>%select(-Boro)" in the line below with whatever datatable you have to match Pct with Boro,
#and replace "crime_df" with any new datatable you want to create
#merge(crime_df%>%select(-Boro),match_pct_boro, by.x="Pct",by.y = "Pct",all.x=TRUE)->crime_df
Missing Description of Offense (OffenseDesc)
#match_code_desc can be used to retrieve OffenseDesc(when missing) from ClassCode
crime_df%>%select(ClassCode,OffenseDesc)%>%group_by(ClassCode)%>%
dplyr::summarise(OffenseDesc=paste(unique(OffenseDesc),collapse=","))%>%
mutate(OffenseDesc=str_replace(OffenseDesc,",NA",""))%>%mutate(OffenseDesc=str_replace(OffenseDesc,"NA,",""))->match_code_desc
#For cases with missing OffenseDesc, how they distribute over the code ClassCode
crime_df%>%
select(ClassCode,OffenseDesc)%>%
filter(is.na(OffenseDesc))%>%
group_by(ClassCode)%>%dplyr::summarise(count=n())->tmp1
#showing the supposed OffenseDesc that is missing with its ClassCode
merge(tmp1,match_code_desc,by.x="ClassCode",by.y="ClassCode")%>%arrange(dplyr::desc(count))->match_bycount
par(mgp=c(1,0.3,0),mai=c(0.4,1.8,0.01,0.01))
data2<-match_bycount[order(match_bycount[,"count"]),]
barplot(data2[,"count"],names.arg=abbreviate(data2[,"OffenseDesc"],minlength=20),cex.names = 0.6,cex.axis=0.7,cex.lab=0.8,horiz=TRUE,xlim=c(0,17500),las=1,xlab="Count")

#Again, merging is commented out, perform it when needed
#merge(crime_df%>%select(-OffenseDesc),match_code_desc, by.x="ClassCode",by.y = "ClassCode",all.x=TRUE)->crime_df
Since the description of the offense can be inferred from the classification code, we should not be concerned about the missing values in this variable.
- OffenseDesc (with missing values) is the description of offense corresponding with key code ClassCode (which is complete in the dataset). Code and description map to each other and a valid OffenseDesc can be inferred from a map established from the dataset. The plot above shows missing counts of the ClassCode categories with OffenseDesc originally missing but now retrieved from the map between ClassCode and OffenseDesc.
- OffenseDesc (with missing values) is the description of offense corresponding with key code ClassCode which is complete in the dataset. Code and description map each other and valid OffenseDesc can be infered from a map established from the dataset.
- The plot above shows missing counts of the ClassCode categories with OffenseDesc originally missing but now retrieved from the map between ClassCode and OffenseDesc.
Missing Geolocation
#For cases with missing geolocation (using Lat here), how they distribute over the code ClassCode
crime_df%>%
select(ClassCode,Lat)%>%
filter(is.na(Lat))%>%
group_by(ClassCode)%>%dplyr::summarise(count=n())->tmp1
merge(tmp1,match_code_desc,by.x="ClassCode",by.y="ClassCode")%>%arrange(desc(count))->match_bycount
#par(mar=c(4.1,15.1,2.1,2.1))
par(mgp=c(1,0.2,0),mai=c(0.4,2.5,0.01,0.5))
data2<-match_bycount[order(match_bycount[,"count"]),]
barplot(data2[,"count"],names.arg=abbreviate(data2[,"OffenseDesc"],minlength=20),horiz=TRUE,cex.names = 0.4,cex.axis=0.7,cex.lab=0.8,xlim=c(0,50000),las=1,xlab="Count")

* The 5 geolocation variables have the same missing pattern as expected. So we only need to look at one of them to examine the missing. In the data document, it stated that “to protect victim identities, rape and sex crime offenses are not geocoded”. We want to see if the missing of geo variables are mostly related with those crime? Is there a lot of missing for other crimes too?
- The missing in geolocation is obviously not random. When examine the spatial pattern of the crimes, we have to bear in mind that particular crimes will not appear on the map due to missing not at random.
Main Analysis
Crime in New York is a rich and varied topic, a topic which allowed us to explore many angles and use many techniques. Below, you will find graphs, and the insights derived from them, organized around the concepts of What, Where, When and Why.
What: What kinds of crimes have been happening?
- Breakdown of crime by Level (Misdemeanors, Felonies, then Violations)
First, we will examine the broadest categorization of crime: the legal level of the offense. In the raw dataset, this field is called “LAW_CAT_CD”, and we have shortened that the Level.
# Remove Invalid Dates
crime_df <- crime_df %>% filter(year(DateStart)>2005)
crime_level <- crime_df %>%
group_by(Level) %>%
summarize(count=n()) %>%
mutate(freq= count/sum(count)*100)
ggplot(crime_level, aes(reorder(Level, freq), freq)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Level of offense") +
ylab("Count in Percent") +
ggtitle("Distribution of Crime Levels")

What we see in this graph is how much of crime is accounted for by Misdemeanors (around 58%), followed by Felonies (around 30%), and then Violations(12%).
- Normal distribution of daily crime level of those levels
# see shape of the daily counts... normal?
daily_df <-crime_df %>%
group_by(DateStart,Level) %>% summarize(CrimeCount=n())
ggplot(daily_df, aes(x=CrimeCount)) +
geom_density(aes(group=Level, color=Level, fill=Level), alpha=0.3) +
ggtitle("Density Curves of Daily Crime Count by Level of Crime")

Another view of the Level of crimes is on a daily basis. We can see how the more numerous the Level of crime is, the more variation there is on a daily basis. All three Levels seem normally distributed, although the Misdemeanor Level has a bit of a plateau at the top, and there is a very, very long tail to the Violation data, suggesting a small number of days with record violations (January 1, 2010, 1,473 violations were recorded. We will discuss the January 1 phenomenon below).
- Leading types of crimes
Now let’s look a layer deeper. The dataset includes codes and descriptions that give us another level of granularity in the type of offense reported. The OffenseDesc (OFNS_DESC) tells us more.
crime_top <- crime_df %>%
filter(OffenseDesc!="") %>%
group_by(OffenseDesc) %>%
summarize(count=n()) %>%
mutate(rel_freq = count/sum(count)) %>%
top_n(n=15, wt=count)
ggplot(crime_top, aes(reorder(OffenseDesc,rel_freq), rel_freq)) +
geom_col() +
coord_flip() +
xlab("Offense Category") +
ylab("Crime Count Frequency") +
ggtitle(" Top Crime Offense Categories")

We can see from this graph that Petit Larceny accounts for the largest number of crimes in New York City, followed, but not very closely, by Harrassment 2, and then Assault 3 & Related Offenses.
In examining this list, we had two things jump out at us: Assault, third on the list, and over 10% of all crime, seems pretty serious. We decided to take a closer look at all violent crimes because of this (and the fact that when you are concerned about crime, violent crime is the most frightening kind). Dangerous Drugs, sixth on this list, is another category of note, particularly with the way trends in drug abuse reach the news with alacrity, and a number of states of legalized use of Marijuana.
As such, we intend to examine those categories of crime in addition to trends by Level.
Where: Where does crime take place?
The question of where crime happens has multiple perspectives of import:
- Where would I choose to live or work to avoid crime?
- Should we adjust policing strategies to try to reduce crime in high crime areas?
Borough Analysis
Like Level is to Crime as a whole, Borough is to The City of New York. The first question about location is the most macro: how does crime differ in the five Boroughs?
- Total vs. Per Capita by Level
# bring in Borough Population and massage it
bdf <- fread("../Data_Files/BoroughPop.csv")
bdf <- bdf[1:6,]
bdf$Boro <- c("TOTAL","BRONX","BROOKLYN","MANHATTAN","QUEENS","STATEN ISLAND")
# summarize for mosaic, per capita plots
df_bsum <-crime_df %>%
filter(!is.na(Boro)) %>%
group_by(Boro,Level) %>%
summarize(Freq = n())
# merge in the borough population
df_bsum <- merge(df_bsum, bdf, by="Boro")
# per capita calculation
df_bsum$PerCap <-df_bsum$Freq/df_bsum$`2016 Estimate`
# Mosaic By Count"
mosaic(Level~Boro,df_bsum, direction=c("v","h"), main="Crime by Borough by Level", labeling=labeling_border(rot_labels=c(15,0,0, 0), offset_labels = c(0,0,0,2.2), offset_varnames = c(1,0,0,5.0), just_labels=c("left", "left", "left", "center")))

The first thing we can see is that there is more crime in Brooklyn than any other Borough, and the amount of crime in Staten Island is very small. However, crime is committed by people, and the number of people in each Borough is different. Hence, we should look at Crime Per Capita.
# By Per Capita -- you have to have "Freq" be the column for the thing the Mosaic will use for frequency, so
# for Per Capita, you need to swap the Freq column names
colnames(df_bsum)[colnames(df_bsum)=="Freq"] <- "Count"
colnames(df_bsum)[colnames(df_bsum)=="PerCap"] <- "Freq"
mosaic(Level~Boro,df_bsum, direction=c("v","h"), main="Crime per Capita by Borough by Level",labeling=labeling_border(rot_labels=c(15,0,0, 0), offset_labels = c(0,0,0,2.2), offset_varnames = c(1,0,0,5.0), just_labels=c("left", "left", "left", "center")))

By looking at the 2010 Census data by Borough, and comparing that to the overall crime per Borough, we see that the Per Capita view shows a rather different story of crime. Staten Island, due to its small population, actually has a higher crime rate than some other boroughs. In fact, we see less crime per capita in Queens and Brooklyn than the overall crime totals would have us understand
- Crime by Level, Borough and Time
We also have official estimates of the population for 2016, so we can look at both Level and Time, comparing 2010 Per Capita to 2016 Per Capita.
#need to rename the bdf Boro in order to make the merge work
colnames(bdf)[colnames(bdf)=="Boro"] <- "Boro"
# limit to specific years of the population data and test
# start with 2010
# summarize for mosaic, per capita plots
df_bsum2010 <-crime_df %>%
filter(!is.na(Boro)) %>%
filter(DateStart > "2009-12-31" & DateStart < "2011-01-01") %>%
group_by(Boro,Level) %>%
summarize(Freq = n())
# merge in the borough population
df_bsum2010 <- merge(df_bsum2010, bdf, by="Boro")
# per capita calculation
df_bsum2010$PerCap <-df_bsum2010$Freq/df_bsum2010$`2010 Population`
#2010 mosaic
colnames(df_bsum2010)[colnames(df_bsum2010)=="Freq"] <- "Count"
colnames(df_bsum2010)[colnames(df_bsum2010)=="PerCap"] <- "Freq"
# now 2016 Estimate
# summarize for mosaic, per capita plots
df_bsum2016 <-crime_df %>%
filter(!is.na(Boro)) %>%
filter(DateStart > "2015-12-31" & DateStart < "2017-01-01") %>%
group_by(Boro,Level) %>%
summarize(Freq = n())
# merge in the borough population
df_bsum2016 <- merge(df_bsum2016, bdf, by="Boro")
# per capita calculation
df_bsum2016$PerCap <-df_bsum2016$Freq/df_bsum2016$`2016 Estimate`
# By Per Capita -- you have to have "Freq" be the column for the thing the Mosaic will use for frequency, so
# for Per Capita, you need to swap the Freq column names
#2016
colnames(df_bsum2016)[colnames(df_bsum2016)=="Freq"] <- "Count"
colnames(df_bsum2016)[colnames(df_bsum2016)=="PerCap"] <- "Freq"
#mosaic(Level~Boro,df_bsum2016, direction=c("v","h"), main="2016 Crime per Capita by Borough by Level")
#Plot 2010 year over 2016 year by Borough
colnames(df_bsum2010)[colnames(df_bsum2010)=="Freq"] <- "PerCap10"
colnames(df_bsum2016)[colnames(df_bsum2016)=="Freq"] <- "PerCap16"
df_bsum.pcap <- merge(df_bsum2010,df_bsum2016, by=c("Boro","Level"))
df_bsum.pcap$Count.y <- NULL
df_bsum.pcap$Borough.y <- NULL
df_bsum.pcap$"2010 Population.y" <- NULL
df_bsum.pcap$"2016 Estimate.y" <- NULL
tidy_bsum <- tidyr::gather(df_bsum.pcap, key="Year", value="PerCap", -"Boro", -"Level", -"Count.x", -"2010 Population.x", -"2016 Estimate.x", -"Borough.x")
library(ggplot2)
ggplot(tidy_bsum, aes(x=Year, y=PerCap, fill=Level))+
geom_bar(stat="identity",position="dodge") +
scale_fill_discrete(name="Year",
#breaks=c(1, 2),
labels=c("Felony", "Misdemeanor","Violation")) +
xlab("Year")+ylab("Per Capita Crime") +
facet_wrap(~Boro) +
ggtitle("Per Capita Crime Rates by Level by Borough by Time")

From these graphs, we see how there is an apparent drop in the rate of crime between 2010 and 2016, mostly driven by Misdemeanors (in every Borough, but most predominantly in the Bronx). We can see that the Felony rate has been mostly unchanged, except in Manhattan. Violations have gone up in every Borough except Staten Island. (We will explore more about crime over time in the When section.)
- Crime over two time periods
ggplot(tidy_bsum, aes(x=reorder(Boro, -PerCap), y=PerCap, fill=Year))+
geom_bar(stat="identity",position="dodge") +
scale_fill_discrete(name="Year",
#breaks=c(1, 2),
labels=c("2010 Census", "2016 Estimate")) +
xlab("Borough")+ylab("Per Capita Crime") +
facet_wrap(~Year+Level, scales = "free") +
theme(axis.text.x=element_text(angle=90,hjust=1))+
scale_fill_brewer(palette="Paired") +
ggtitle("Per Capita Crime Rates by Level by Borough by Time")

From this view, we also see there there are some notable differences between Boroughs in the Per Capita Crime Rates. Manhattan leads the way on Felonies, followed by the Bronx. The top two are in opposite order for Misdemeanors. But it is Staten Island and the Bronx that lead in Violations. For lowest rates, Staten Island is lowest for Felonies, followed by Queens, with Queens lowest for Misdemeanors and Violations. Hence, the crime profile is quite different in each Borough.
- Main Categories of Crime by Borough
We can also get more specific about the top 6 types of crime, in terms of crime in the Boroughs.
top_ofns <- c("PETIT LARCENY", "HARRASSMENT 2", "CRIMINAL MISCHIEF & RELATED OF", "ASSAULT 3 & RELATED OFFENSES", "GRAND LARCENY", "DANGEROUS DRUGS")
label_list <- c("PETIT LARCENY", "HARRASSMENT 2", "CRIMINAL MISCHIEFF", "ASSAULT 3" , "GRAND LARCENY", "DANGEROUS DRUGS")
crime_sort <- crime_df %>%
filter(Boro != "") %>%
filter(OffenseDesc %in% top_ofns) %>%
group_by(Boro,OffenseDesc) %>%
summarize(Freq=n()) %>%
mutate(rel_freq = Freq/sum(Freq))
crime_sort$OffenseDesc <- factor(crime_sort$OffenseDesc)
mosaic(OffenseDesc~Boro, main ="Top Offense Distribution across Boroughs", direction=c("v"), labeling=labeling_border(rot_labels=c(15,0,0, 0), offset_labels = c(0,0,0,3.3), offset_varnames = c(1,0,0,7.6), just_labels=c("left", "left", "left", "center")), crime_sort)

Several observations can be made from this view:
1. Dangerous Drugs take up a disproportionately high share of crime in the Bronx, while in Queens, Dangerous Drugs account for far fewer of the crimes.
2. Larceny, both Petit and Grand, are a large share of crime in Manhattan when compared to other Boroughs.
3. Harassment 2 is the most prevalent of these top six in Staten Island
#doubledecker(TOP_OFFENSE~Boro, data=crime_sort)
ggplot(crime_sort, aes(OffenseDesc,Freq)) +
geom_col() +
ylab("Offense Count") +
xlab("Offense Description") +
facet_wrap(~ Boro) +
coord_flip() +
ggtitle(" Top Offense Categories Distribution in Different Boroughs")

This view makes the data easier to compare these types of crime within Boroughs. Brooklyn, for instance, has to most even distribution of crime across these categories, while Manhattan has much higher proportion of Petit Larceny than any of the other categories here. Drugs and Violent Crime: This graph also reinforces how significantly Dangerous Drugs factor into crime in the Bronx, given how that category is nearly as prevalent as any other. It’s also worth noting where Assault factors in for each Borough: In Manhattan, it’s vastly outnumbered by each type of Larceny, relatively high in the Bronx and Queens, and sort of in the middle in Brooklyn.
The dataset includes a field called “PREM_TYP_DESC” (which we have shortened to PremDesc) to indicate where the crime took place, such as on the street, in a house, etc. It turns out that about 85% of crime occurs on the Streets or within a Residence of one type or another.
crime_place <- crime_df %>%
filter(!is.na(PremDesc),Level !="") %>%
group_by(PremDesc) %>%
summarize(count=n()) %>%
top_n(n=10, wt=count) %>%
mutate(rel_freq = count/sum(count))
ggplot(crime_place, aes(fct_reorder(PremDesc, rel_freq), rel_freq)) +
geom_bar(stat="identity") +
coord_flip() +
ylab("Crime Count Frequency") +
xlab("Place of Crime - Premises") +
ggtitle(" Top Ten Crime Premises")

Here we see how crime on the Street is the largest, single category (nearly 40%), but if you add together the various Residence categories, Residences total 45%. This vastly outnumbers the Subway, Commercial buildings, etc.
crime_place_2 <- crime_df %>%
filter(!is.na(PremDesc),Level !="") %>%
group_by(Level,PremDesc) %>%
summarize(count=n()) %>%
top_n(n=10, wt=count) %>%
mutate(rel_freq = count/sum(count)) %>%
ungroup() %>%
arrange(Level, rel_freq) %>%
unite("PremLevel", PremDesc, rel_freq, sep = "_", remove = FALSE) %>%
data.frame() %>%
mutate(PremLevel = factor(PremLevel, levels = PremLevel))
ggplot(crime_place_2, aes(PremLevel, rel_freq)) +
geom_bar(stat="identity") +
facet_wrap(~Level, scales="free") +
coord_flip() +
ylab("Crime Count Frequency") +
xlab("Place of Crime - Premises") +
scale_x_discrete(breaks=crime_place_2$PremLevel,labels=crime_place_2$PremDesc) +
ggtitle(" Place of Crime Distribution Across Crime Categories")
While Felonies and Misdemeanors follow the same pattern as crime, overall, this graph shows that more Violations happen in Apartments than on the street.
Executive Summary
Interactive Component
Conclusion
#ap crime_df%>%select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD
#ap crime_df%>%select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2000)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD
#ap
#ap cbind(rbind(data.frame(Year=2000:2005,Report=NA),totalCntByRD),
#ap totalCntBySD[,2])%>%gather(key,value=count,-Year)->totalCnt
#ap ggplot(totalCnt)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank())->p1
#ap totalCnt%>%filter(Year>=2006)%>%ggplot()+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+ylim(2006,2016)+theme(legend.title=element_blank())->p2
#ap grid.arrange(p1,p2)
- The total frequency over each year has a trend of declining. This may be due to lots of cases occurred over the years haven’t been reported yet. It makes sense the later the more cases expected to be reported. But doesnt seem to explain the big decreasing in # of cases reported over the year. Most likely the crime is decreasing over the year!!
- The big difference between Start and Report in Year 2016 again mostly like is because a lot of cases occurred in Year 2016 haven’t been able to be presented in this figure due to span of DateReport cut off at Dec. 31,2016.
- Number of Cases with DateStart earlier than Jan. 1, 2000 are very small.
#picking non-missing DateStart and filter only those after "2006-01-01", 5560408 obs.
crime_df%>%select(DateStart,Level)%>%filter(!is.na(DateStart))%>%filter(DateStart>=as.Date("2006-01-01"))->df_Date
#time series of daily frequency of 3 crime categories 2006-2016
df_Date%>%group_by(DateStart,Level)%>%dplyr::summarise(count=n())%>%ungroup()%>%group_by(Level)%>%mutate(mon_mean=rollmean(count,30,fill=NA))%>%ungroup()->byDateLawMean
#daily rate
byDateLawMean%>%ggplot()+
geom_line(aes(DateStart,count,color=Level))+
geom_line(aes(DateStart,mon_mean,group=Level))+
ggtitle("Daily Crime Frequency since 2000 with 30-day running mean")+
labs(x="Date",y="Frequency")+theme(legend.title=element_blank())+geom_line(aes(DateStart,count*0+1150))

#Top 9 daily rate falls on Jan 1.
byDateLawMean%>%arrange(dplyr::desc(count))%>%filter(count>=count[9])%>%
mutate(DateStart=as.factor(DateStart))%>%
ggplot(aes(forcats::fct_reorder(DateStart, count),count))+geom_bar(stat="identity")+coord_flip()+ylab("Top 9 Daily Crime Frequency")+xlab("Date")

- The time series of crime frequency is decreasing over the years, consistent with shown in the cleveland dot plots of total count by year.
- There are obvious annual variation/cycle. 30-day running mean shows the cycle clearly.
- There are spikes in the misdemeanor category and also felony category before 2006. The top 9 dates with high frequency are shown in the barchart. They are on January 1 on almost each year from 2000-2016 except 2015 which is actually very close behind. When reported, cases tend to be rounded onto Jan. 1 as occurrence date.
crime_df%>%select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD
crime_df%>%select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2006)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD
cbind(totalCntByRD,totalCntBySD[,2])%>%gather(key,value=count,-Year)->totalCnt
ggplot(totalCnt)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank(),legend.position=c(0.5,0.75))+ggtitle("ALL")->p1
crime_df%>%filter(Level=="FELONY")%>%select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD
crime_df%>%filter(Level=="FELONY")%>%select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2006)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD
cbind(totalCntByRD,totalCntBySD[,2])%>%gather(key,value=count,-Year)->totalCnt
ggplot(totalCnt)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank(),legend.position=c(0.7,0.75))+ggtitle("FELONY")->p2
crime_df%>%filter(Level=="MISDEMEANOR")%>%select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD
crime_df%>%filter(Level=="MISDEMEANOR")%>%select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2006)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD
cbind(totalCntByRD,totalCntBySD[,2])%>%gather(key,value=count,-Year)->totalCnt
ggplot(totalCnt)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank(),legend.position=c(0.3,0.4))+ggtitle("MISDEMEANOR")->p3
crime_df%>%filter(Level=="VIOLATION")%>%select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD
crime_df%>%filter(Level=="VIOLATION")%>%select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2006)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD
cbind(totalCntByRD,totalCntBySD[,2])%>%gather(key,value=count,-Year)->totalCnt
ggplot(totalCnt)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank(),legend.position=c(0.5,0.7))+ggtitle("VIOLATION")->p4
grid.arrange(p1,p2,p3,p4,nrow=2)

crime_df%>%filter(OffenseDesc == "DANGEROUS DRUGS")%>% select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD_F
crime_df%>%filter(OffenseDesc == "DANGEROUS DRUGS")%>%select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2006)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD_F
cbind(totalCntByRD_F,totalCntBySD_F[,2])%>%gather(key,value=count,-Year)->totalCnt_F
ggplot(totalCnt_F)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank(),legend.position = c(0.8,0.7))+ggtitle("DANGEROUS DRUGS")->p1
crime_df%>%filter(OffenseDesc == "ASSAULT 3 & RELATED OFFENSES" |
OffenseDesc == "FELONY ASSAULT" |
OffenseDesc == "RAPE" |
OffenseDesc == "ROBBERY" |
IntOffenseDesc == "AGGRAVATED SEXUAL ASBUSE" |
IntOffenseDesc == "ASSAULT 2,1,UNCLASSIFIED" |
IntOffenseDesc == "ASSAULT 3" |
IntOffenseDesc == "RAPE 1" |
IntOffenseDesc == "ROBBERY,OPEN AREA UNCLASSIFIED" |
IntOffenseDesc == "SEXUAL ABUSE" |
IntOffenseDesc == "SEXUAL ABUSE 3,2")%>% select(DateReport)%>%mutate(Year=year(DateReport))%>%group_by(Year)%>%dplyr::summarise(Report=n())->totalCntByRD_F
crime_df%>%filter(OffenseDesc == "ASSAULT 3 & RELATED OFFENSES" |
OffenseDesc == "FELONY ASSAULT" |
OffenseDesc == "RAPE" |
OffenseDesc == "ROBBERY" |
IntOffenseDesc == "AGGRAVATED SEXUAL ASBUSE" |
IntOffenseDesc == "ASSAULT 2,1,UNCLASSIFIED" |
IntOffenseDesc == "ASSAULT 3" |
IntOffenseDesc == "RAPE 1" |
IntOffenseDesc == "ROBBERY,OPEN AREA UNCLASSIFIED" |
IntOffenseDesc == "SEXUAL ABUSE" |
IntOffenseDesc == "SEXUAL ABUSE 3,2")%>% select(DateStart)%>%mutate(Year=year(DateStart))%>%filter(Year>=2006)%>%group_by(Year)%>%dplyr::summarise(Start=n())->totalCntBySD_F
cbind(totalCntByRD_F,totalCntBySD_F[,2])%>%gather(key,value=count,-Year)->totalCnt_F
ggplot(totalCnt_F)+geom_point(aes(count,Year,color=key),size=5)+coord_flip()+theme(legend.title=element_blank(),legend.position = c(0.3,0.75))+ggtitle("Violent Crime")->p2
grid.arrange(p1,p2,nrow=2)

#frequency by month
df_Date%>%mutate(Month=as.character(month(DateStart)))%>%group_by(Month,Level)%>%dplyr::summarise(CntByMon=n())->byDateLaw_mon
byDateLaw_mon%>%mutate(Days=rep(31,3))%>%mutate(Days=ifelse(Month=="2",28,Days))%>%mutate(Days=ifelse(Month %in% c("4","6","9","11"),30,Days))->byDateLaw_mon
byDateLaw_mon%>%ggplot(aes(fct_relevel(Month,"10","11","12",after=9),CntByMon/Days))+geom_bar(stat="identity")+coord_flip()+ylab("Crime Frequency (Monthly Mean)")+facet_wrap(~Level,scales="free_x")+xlab("Month")->p1
#frequency by day
df_Date%>%mutate(Day=as.factor(format(DateStart,"%d")))%>%group_by(Day,Level)%>%dplyr::summarise(CntByDay=n())->byDateLaw_day
#Day1-28 has the same total cnts=11yrs*12cnts/yr
#Day 29 cnts=11yrs*11cnts/yr+3cnts (leap yrs)
#Day 30 cnts=11*11; Day 31 cnts=7*11
byDateLaw_day%>%mutate(cnts=rep(12*11,3))%>%mutate(cnts=ifelse(Day=="29",11*11+3,cnts))%>%mutate(cnts=ifelse(Day=="30",11*11,cnts))%>%mutate(cnts=ifelse(Day=="31",7*11,cnts))->byDateLaw_day
byDateLaw_day%>%ggplot(aes(Day,CntByDay/cnts))+geom_bar(stat="identity")+coord_flip()+ylab("Crime Frequency (Daily Mean)")+facet_wrap(~Level,scales="free_x")+xlab("Day of Month")->p2
#frequency by weekday
df_Date%>%mutate(Wkday=as.factor(weekdays(DateStart,abbreviate=TRUE)))%>%group_by(Wkday,Level)%>%dplyr::summarise(CntByWkday=n())->byDateLaw_wkday
byDateLaw_wkday%>%ggplot(aes(fct_relevel(Wkday,"Mon","Tue","Wed","Thu","Fri","Sat","Sun"),CntByWkday))+geom_bar(stat="identity")+coord_flip()+ylab("Crime Frequency")+facet_wrap(~Level,scales="free_x")+xlab("Day of Week")->p3
#picking non-missing TimeStart
crime_df%>%filter(!is.na(TimeStart))%>%filter(DateStart>=as.Date("2006-01-01"))->df_FRTM
#Frequency by hour of day, combining hour 00 and hour 24 into hour 00
df_FRTM%>%mutate(Hour=as.factor(substr(TimeStart,1,2)))%>%group_by(Hour,Level)%>%dplyr::summarise(CntByHour=n())->byDateLaw_hour
byDateLaw_hour$Hour[byDateLaw_hour$Hour=="24"]<-"00"
byDateLaw_hour$Hour<-factor(byDateLaw_hour$Hour)
byDateLaw_hour%>%ggplot(aes(Hour,CntByHour))+geom_bar(stat="identity")+coord_flip()+ylab("Crime Frequency")+facet_wrap(~Level,scales="free_x")+xlab("Hour of Day")->p4
grid.arrange(p1,p2,p3,p4,nrow=2)

- Indeed by barcharting over the months,we see Jun.-Oct. is a high crime season.
- The spike in Janurary is consistent with the analysis above.
- There seemed having a tendency of rounding every 5 day.
- Violation is low during weekends but same during weekdays.
- Felony and misdemeanor is high on Friday but low on Sunday nad Monday.
- There is obvious day cycle in the crime occurrence. Early morning has the least crime occurrence while later afternoon has the most crime occurrence.
#how the different crime types (OffenseDesc) associated with different places (a heatmap)
crime_df%>%
select(ClassCode,PremDesc)%>%
filter(!is.na(PremDesc))%>%
group_by(ClassCode,PremDesc)%>%dplyr::summarise(count=n())%>%mutate(pct=count/sum(count))->byKYbyPREM
#merging to get OffenseDesc vs PremDesc correspondence
match_code_desc%>%mutate(ClassCode=as.factor(ClassCode))->match_code_desc
merge(byKYbyPREM, match_code_desc, by.x='ClassCode', by.y='ClassCode')->byKYbyPREM_match
byKYbyPREM_match%>%group_by(OffenseDesc)%>%dplyr::summarise(mean=mean(count),na.rm=TRUE)%>%arrange(dplyr::desc(mean))->desc_desc_cnt
byKYbyPREM_match%>%group_by(PremDesc)%>%dplyr::summarise(mean=mean(count),na.rm=TRUE)%>%arrange(dplyr::desc(mean))->PREM_desc_cnt
byKYbyPREM_match%>%
ggplot(aes(fct_relevel(as.factor(OffenseDesc),as.character(desc_desc_cnt$OffenseDesc[sort(desc_desc_cnt$mean,index.return=TRUE,decreasing=TRUE)$ix])),
fct_relevel(as.factor(PremDesc),as.character(PREM_desc_cnt$PremDesc[sort(PREM_desc_cnt$mean,index.return=TRUE,decreasing=TRUE)$ix])),fill=pct))+scale_fill_gradientn(colors=c("red","orange","yellow","green","blue","violet"),na.value="black")+
scale_x_discrete(label=function(x) abbreviate(x, minlength=20))+coord_flip()+geom_tile(color="white",size=0.25)+
theme(axis.text.x = element_text(size=3,angle = 45, hjust =1), axis.text.y=element_text(size=4), legend.position="top",legend.text=element_text(size=6,hjust=0.5),legend.title=element_text(size=8),
legend.key = element_rect(size = 0.5),legend.key.size = unit(1, 'lines'))+ylab("Premises")+xlab("OffenseDesc")

- Doesn’t seem having association between crime types and premises.
#how the different crime types associated with time using heatmap
crime_df%>%
select(ClassCode,TimeStart)%>%
filter(!is.na(TimeStart))%>%
mutate(ClassCode=as.factor(ClassCode))%>%
mutate(Hour=as.factor(substr(TimeStart,1,2)))%>%
group_by(ClassCode,Hour)%>%dplyr::summarise(count=n())%>%mutate(pct=count/sum(count))->byKYbyFRTM
#combining hour 00 and hour 24 into hour 00
byKYbyFRTM$Hour[byKYbyFRTM$Hour=="24"]<-"00"
byKYbyFRTM$Hour<-factor(byKYbyFRTM$Hour)
#merging to get OffenseDesc vs TimeStart correspondence
merge(byKYbyFRTM, match_code_desc, by.x='ClassCode', by.y='ClassCode')->byKYbyFRTM_match
byKYbyFRTM_match%>%group_by(OffenseDesc)%>%dplyr::summarise(mean=mean(count),na.rm=TRUE)->desc2_desc_cnt
byKYbyFRTM_match%>%group_by(Hour)%>%dplyr::summarise(mean=mean(count),na.rm=TRUE)->Hour_desc_cnt
byKYbyFRTM_match%>%ggplot(aes(
fct_relevel(as.factor(OffenseDesc),as.character(desc2_desc_cnt$OffenseDesc[sort(desc2_desc_cnt$mean,index.return=TRUE,decreasing=TRUE)$ix])),
Hour,fill=pct))+scale_fill_gradientn(colors=c("red","orange","yellow","green","blue","violet"),na.value="black")+
scale_x_discrete(label=function(x) abbreviate(x, minlength=20))+coord_flip()+geom_tile(color="white",size=0.25)+
theme(axis.text.x = element_text(size=5, hjust = 0.5),axis.text.y=element_text(size=4),legend.position="bottom",
legend.text=element_text(size=6,hjust=0.5),legend.title=element_text(size=8),legend.key = element_rect(size = 0.5),legend.key.size = unit(1, 'lines'))+ylab("Hour")+xlab("OffenseDesc")

<<<<<<< HEAD Jingbo check if you want to modify the variable names in R code above, also try changing percent from pct to percent or something, I got misled that its precinct
A few observations from this heat map:
* That dark, blue/purple box represents “Loitering/DeviateSex” at 4am, with a green box at 3am. Just a coincidence that as bars are getting set to close, and after they close, we see a lot of people milling around, “loitering”?
* We can also see a peak for “Intoxicated/Impaired” in the 3am hour, just before bars close
* There is an odd pattern with “Under the Influence of Drugs” bunches up in three specific hours: Midnight, 9am and 6pm. * Generally, most crime categories show a relatively even distribution across the clock, with higher proportions in the afternoon, and lower proportions in the early morning, in the hours between 5 and 7am.
Why: what factors may contribute to more (or less) crime?
We came up with a series of ideas of factors that may contribute to the volume of crime. Each required we find a daily dataset with variables we could add to our main dataset.
We began with temperature data from NOAA.
- Temperature: Hotter vs. colder
Rich’s graph
This graphs shows that in nearly all precincts, the colder days have fewer crimes committed than on days with hot weather. We can also look to see if the effect is different for the level of the crime.
- Precipitation We continue with our weather data to see if rain makes a difference in crime. It turns out that a bit of rain does seem to reduce crime a bit, as the slope-trend is negative
########################################################################################
#Read in weather data from file
weather_select = c("DATE", "AWND", "PRCP", "SNOW", "TMAX")
weather_data <- fread("../Data_Files/nyc_weather_data.csv", na.strings="", select = weather_select, stringsAsFactors = FALSE)
weather_data$DATE <- as.Date(weather_data$DATE)
weather_data$AWND <- as.numeric(weather_data$AWND)
weather_data$PRCP <- as.numeric(weather_data$PRCP)
weather_data$SNOW <- as.numeric(weather_data$SNOW)
weather_data$TMAX <- as.numeric(weather_data$TMAX)
#Merge the data together
crime_w_df <- crime_w_df[weather_data, on=.(DateStart = DATE)]
###############################################################################
#Read in moon phase data
moon_data <- fread("../Data_Files/nyc_moon_data.csv", na.strings="", select = c("date", "phase"), stringsAsFactors = FALSE)
moon_data$date <- as.Date(moon_data$date, format='%m/%d/%Y')
moon_data$phase <- as.factor(moon_data$phase)
full_moon_data <- moon_data %>% filter(phase == "Full Moon")
#Merge the moon phase data into the main data frame
crime_w_df <- crime_w_df %>% left_join(moon_data, by = c("DateStart" = "date"))
#Generate violent crime dataframe
#filter for violent crime
violent_crime_df <- crime_w_df %>% filter(OffenseDesc == "ASSAULT 3 & RELATED OFFENSES" |
OffenseDesc == "FELONY ASSAULT" |
OffenseDesc == "MURDER & NON-NEGL. MANSLAUGHTER" |
OffenseDesc == "RAPE" |
OffenseDesc == "ROBBERY" |
IntOffenseDesc == "AGGRAVATED SEXUAL ASBUSE" |
IntOffenseDesc == "ASSAULT 2,1,UNCLASSIFIED" |
IntOffenseDesc == "ASSAULT 3" |
IntOffenseDesc == "RAPE 1" |
IntOffenseDesc == "ROBBERY,OPEN AREA UNCLASSIFIED" |
IntOffenseDesc == "SEXUAL ABUSE" |
IntOffenseDesc == "SEXUAL ABUSE 3,2")
#Generate scatterplot of crime vs precipitiation
rain_summary_per_day <- crime_w_df %>% group_by(DateStart, Level) %>% summarize(Count = n()) %>% drop_na()
#append weather data
rain_summary_per_day <- rain_summary_per_day %>% left_join(weather_data, by = c("DateStart" = "DATE")) %>% select(DateStart, Level, Count, PRCP)
#Scatter plot of daily crimes vs. precipitation level
#Filter on Level of crime and generate linear model for each
# linear model: Felonies
f_df_rain <- rain_summary_per_day %>% filter(Level=="FELONY")
flm_rain <- lm(Count~PRCP, f_df_rain)
# linear model: Misdemeanors
m_df_rain <- rain_summary_per_day %>% filter(Level=="MISDEMEANOR")
mlm_rain <- lm(Count~PRCP, m_df_rain)
# linear model: Violation
v_df_rain <- rain_summary_per_day %>% filter(Level=="VIOLATION")
vlm_rain <- lm(Count~PRCP, v_df_rain)
#Plot data for all three crime levels vs precipitation with linear model results
ggplot(rain_summary_per_day, aes(x=PRCP, y=Count, color=Level)) +
geom_point(alpha=0.3) +
geom_abline(slope=flm_rain[["coefficients"]][["PRCP"]],intercept=flm_rain[["coefficients"]][["(Intercept)"]]) +
annotate("text", x= 3, y=420, label=paste0("y=",round(flm_rain[["coefficients"]][["PRCP"]],2),"x+",round(flm_rain[["coefficients"]][["(Intercept)"]],0))) +
geom_abline(slope=mlm_rain[["coefficients"]][["PRCP"]],intercept=mlm_rain[["coefficients"]][["(Intercept)"]]) +
annotate("text", x= 3, y=750, label=paste0("y=",round(mlm_rain[["coefficients"]][["PRCP"]],2),"x+",round(mlm_rain[["coefficients"]][["(Intercept)"]],0))) +
geom_abline(slope=vlm_rain[["coefficients"]][["PRCP"]],intercept=vlm_rain[["coefficients"]][["(Intercept)"]]) +
annotate("text", x= 3, y=180, label=paste0("y=",round(vlm_rain[["coefficients"]][["PRCP"]],2),"x+",round(vlm_rain[["coefficients"]][["(Intercept)"]],0))) +
labs(x = "Precipitation [inches]", y = "Daily Crime Incident Count", title = "Daily Crime Counts vs. Precipitation by Level of Crime with Linear Models")

#Plot data only for violent crimes vs precipitation with linear model result
#Generate scatterplot of crime vs precipitiation
vc_df_rain <- violent_crime_df %>% group_by(DateStart) %>% summarize(Count = n()) %>% drop_na()
#append weather data
vc_df_rain <- vc_df_rain %>% left_join(weather_data, by = c("DateStart" = "DATE")) %>% select(DateStart, Count, PRCP)
#Scatter plot of daily crimes vs. precipitation level
#Generate linear model for Violent Crime vs. Precipitation
vclm_rain <- lm(Count~PRCP, vc_df_rain)
ggplot(vc_df_rain, aes(x=PRCP, y=Count)) +
geom_point(alpha=0.3) +
geom_abline(slope=vclm_rain[["coefficients"]][["PRCP"]],intercept=vclm_rain[["coefficients"]][["(Intercept)"]]) +
annotate("text", x= 3.5, y=240, label=paste0("y=",round(vclm_rain[["coefficients"]][["PRCP"]],2),"x+",round(vclm_rain[["coefficients"]][["(Intercept)"]],0))) +
labs(x = "Precipitation [inches]", y = "Daily Violent Crime Incident Count", title = "Daily Violet Crime Counts vs. Precipitation with Linear Models")

We can clearly see a reduction in the average number of crimes reported on days with significant precipitation. This is not very surprising as we would expect less people to be out on days with bad weather. The effect of increasing precipitation on misdemeanors appears to be the most significant, with a lesser impact on felonies and violations. When looking at the same analysis after filtering only on Violent Crimes, we see a relationship very similar to that of felonies, which is what we would expect considering most violent crimes are felonies.
A lot of the data is on days with no rain, and you can read right off the graph that there is a fairly wide range of daily crime rate on days with no rain. However, as the amount of rain starts to register, the density of the dots seem to shift downward, indicating that more rain = less crime. The pattern for Violent Crimes is similar, less violent crime on rainy days.
We brought in data on the phases of the moon, just to test the idea that with a full moon we might see more crime. Articles such as in Decoded Science (https://www.decodedscience.org/full-moons-crime-aka-lunar-effect-real-deal-pseudoscience/41881) examine the issue, but we thought we could examine it in our dataset.
moon_summary <- crime_w_df %>%
filter(phase == "Full Moon" | phase == "New Moon" | phase == "First Quarter" | phase == "Last Quarter") %>%
group_by(DateStart, phase) %>% summarize(Count = n()) %>% drop_na()
moon_avg_crime <- moon_summary %>% group_by(phase) %>% summarize(Avg_Count = weighted.mean(Count))
moon_total_crime <- moon_summary %>% group_by(phase) %>% summarize(Total_Count = sum(Count))
moon_phase_total_count <- sum(moon_total_crime$Total_Count)
moon_total_crime <- moon_total_crime %>% mutate(Pct = Total_Count/moon_phase_total_count)
#Create a pie chart
blank_theme <- theme_minimal()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank(),
plot.title=element_text(size=14, face="bold")
)
ggplot(data = moon_total_crime, aes(x="", y = Total_Count, fill = phase)) +
geom_bar(width = 1, stat = "identity") +
coord_polar(theta = "y", start=0) +
blank_theme +
theme(axis.text.x=element_blank()) +
geom_text(aes(label = scales::percent(Pct)), position = position_stack(vjust = 0.5)) +
ggtitle("Moon Phase vs. Crime Count Analysis")

#Generate the same analysis based on Violent Crimes
vc_moon_summary <- violent_crime_df %>%
filter(phase == "Full Moon" | phase == "New Moon" | phase == "First Quarter" | phase == "Last Quarter") %>%
group_by(DateStart, phase) %>% summarize(Count = n()) %>% drop_na()
vc_moon_avg_crime <- vc_moon_summary %>% group_by(phase) %>% summarize(Avg_Count = weighted.mean(Count))
vc_moon_total_crime <- vc_moon_summary %>% group_by(phase) %>% summarize(Total_Count = sum(Count))
vc_moon_phase_total_count <- sum(vc_moon_total_crime$Total_Count)
vc_moon_total_crime <- vc_moon_total_crime %>% mutate(Pct = Total_Count/vc_moon_phase_total_count)
ggplot(data = vc_moon_total_crime, aes(x="", y = Total_Count, fill = phase)) +
geom_bar(width = 1, stat = "identity") +
coord_polar(theta = "y", start=0) +
blank_theme +
theme(axis.text.x=element_blank()) +
geom_text(aes(label = scales::percent(Pct)), position = position_stack(vjust = 0.5)) +
ggtitle("Moon Phase vs. Violent Crime Count Analysis")

We can see from the data that crime does not appear to be affected significantly on days with a full moon. We can conclude that there is no statistical evidence to support the “old wives tale” about lunatics.
======= >>>>>>> 5caf3443e85d11ffa0f024a03d34de46393b2636
- Do we see any association between time and certain crime? Do see some high density around middle up right area, which is consistent with the barcharting daily cycle.
#matching Pct with Boro
crime_df %>% select(Level,Pct)%>%group_by(Level,Pct)%>%
drop_na()%>%dplyr::summarize(count = n())%>%ungroup()->df_pct
merge(df_pct,match_pct_boro,by.x="Pct",by.y="Pct")->df_pbl
df_pbl%>%mutate(Pct=as.factor(Pct))->df_pbl
df_pbl%>%ggplot(aes(reorder(Pct, count), count,fill=Boro)) + geom_bar(stat = "identity") + xlab("Precint Number") + ggtitle("Incidents by Precinct") + coord_flip()+scale_fill_manual(values = c("red","orange","yellow","green","blue","violet"))+theme(axis.text.x = element_text(size=5, hjust = 0.5),axis.text.y=element_text(size=4),legend.position=c(0.6,0.2),
legend.text=element_text(size=6,hjust=0.5),legend.title=element_text(size=8),legend.key = element_rect(size = 0.5),legend.key.size = unit(1, 'lines'))->p5
df_pbl%>%ggplot(aes(reorder(Pct, count), count,fill=Level)) + geom_bar(stat = "identity") + xlab("Precint Number") + ggtitle("Incidents by Precinct") + coord_flip()+theme(axis.text.x = element_text(size=5, hjust = 0.5),axis.text.y=element_text(size=4),legend.position=c(0.6,0.2),
legend.text=element_text(size=6,hjust=0.5),legend.title=element_text(size=8),legend.key = element_rect(size = 0.5),legend.key.size = unit(1, 'lines'))->p6
grid.arrange(p5,p6,nrow=1)

- Crime frequency by Pct with either borough name colored or crime Level colored.
crime_df%>%filter(!is.na(ParkName))->df_pk
df_pk%>%select(Level)%>%group_by(Level)%>%dplyr::summarise(count=n())%>%mutate(RelFreq = count/sum(count))%>%ggplot(aes(Level,RelFreq))+geom_bar(stat="identity")+
coord_flip()+ylab("Level of Offense")+xlab("Relative Frequency")

* ~12538 cases recorded as occurred in parks/playground or greenspaces. The crime level distribution share the same pattern as the overall data.